Three-Dimensional Speaker Localization: Audio-Refined Visual Scaling Factor Estimation
نویسندگان
چکیده
Neither a monocular RGB camera nor small-size microphone array is capable of accurate three-dimensional (3D) speaker localization. By taking advantage visual object detection, and audio-visual complementary sensor fusion, we formulate the localization problem as scaling factor estimation problem. As result, effectively reduce traditional audio-only 3D from an exhaustive grid search to one-dimensional (1D) optimization We propose multi-modal perception system with two approaches. show that proposed methods are effective, accurate, robust against interference and, corroborated by indicative empirical results on real dataset, competitive conventional uni-modal state-of-the-art
منابع مشابه
Audio-Visual Clustering for Multiple Speaker Localization
We address the issue of identifying and localizing individuals in a scene that contains several people engaged in conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations. We show that the identification and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. W...
متن کاملAudio-visual SPeaker localization for car navigation systems
Human-computer interaction for in-vehicle information and navigation systems is a challenging problem because of the diverse and changing acoustic environments. It is proposed that the integration of video and audio information can significantly improve dialog system performance, since the visual modality is not impacted by acoustic noise. In this paper, we propose a robust audio-visual integra...
متن کاملAV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking
Assessing the quality of a speaker localization or tracking algorithm on a few short examples is difficult, especially when the groundtruth is absent or not well defined. One step towards systematic performance evaluation of such algorithms is to provide time-continuous speaker location annotation over a series of real recordings, covering various test cases. Areas of interest include audio, vi...
متن کاملTwo- and Three-Dimensional Audio-Visual Speech Synthesis
An audio-visual speech synthesiser has been built that will generate animated computer-graphics displays of high-resolution, colour images of a speaker's mouth area. The visual displays can simulate the movements of the lower face of a talker for any spoken sentence of British English, given a text input. The synthesiser is based on a data-driven technique. It uses encoded, video-recorded image...
متن کاملAudio-visual speaker conversion using prosody features
The article presents a joint audio-video approach towards speaker identity conversion, based on statistical methods originally introduced for voice conversion. Using the experimental data from the 3D BIWI Audiovisual corpus of Affective Communication, mapping functions are built between each two speakers in order to convert speaker-specific features: speech signal and 3D facial expressions. The...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Signal Processing Letters
سال: 2021
ISSN: ['1558-2361', '1070-9908']
DOI: https://doi.org/10.1109/lsp.2021.3092959